RAPToR - Building
ReferencesThis vignette is specifically focused on building references, that
are needed to stage samples with RAPToR. For a more general
use of the package, see the "RAPToR"
vignette.
Building references is one of the key aspects of RAPToR,
as a sample needs an appropriate reference to be staged. References are
broadly usable once built, so they are worth the trouble to set up.
Throughout this vignette, you will see the general workflow of
building a reference from the selection of an appropriate dataset, to
validating a model for interpolation. In the midst of the explanations,
examples will be given using the same dsaeschimann2017 and
dshendriks2014 datasets as in the general usage vignette (
the code to load
these can be found at the end of the vignette ).
Finally, a few more examples of reference building on different datasets will be included at the end of this document.
I hope this material will cover your reference-building needs.
Without a transcriptomic time-series spanning the developmental stages of your samples’ organism, I’m afraid there’s not much we can do! Thankfully, these time-series experiments are (increasingly) numerous in the literature and most model organisms will have some kind of data we can use. You may also make (or already have) your own time-series on hand.
There are a few databases you can download data from. The most well-known are the Gene Expression Omnibus (GEO) and the Array Express.
Both of these databases have APIs to get their data directly from R
(e.g the
GEOquery package, as shown in the example dataset
loading scripts).
Several points of the experimental design should be kept in mind when selecting data for a reference.
Bioinformatics is a field plagued by diverse and fast-changing sets
of IDs for genes, transcripts, etc. When you build a reference, you
should always convert it to IDs that are conventional and
stable. We like to use the organism-specific IDs (e.g,
Wormbase for C. elegans : WBGene00016153, Flybase
for Drosophila : FBgn0010583).
The ensembl
biomart or its associated R package biomaRt
are a very useful resource to get gene, transcript or probe ID lists for
conversion.
Below is a code snippet to get gene IDs for drosophila with
biomaRt.
requireNamespace("biomaRt", quietly = TRUE)
# setup connection to ensembl
mart <- biomaRt::useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")
# get list of attributes
droso_genes <- biomaRt::getBM(attributes = c("ensembl_gene_id",
"ensembl_transcript_id",
"external_gene_name",
"flybase_gene_id"),
mart = mart)
head(droso_genes)
#> ensembl_gene_id ensembl_transcript_id external_gene_name flybase_gene_id
#> 1 FBgn0015380 FBtr0330209 drl FBgn0015380
#> 2 FBgn0015380 FBtr0081195 drl FBgn0015380
#> 3 FBgn0010356 FBtr0088240 Taf5 FBgn0010356
#> 4 FBgn0266023 FBtr0343232 lncRNA:CR44788 FBgn0266023
#> 5 FBgn0250732 FBtr0091512 gfzf FBgn0250732
#> 6 FBgn0250732 FBtr0334671 gfzf FBgn0250732
When multiple probe or transcript IDs match a single gene ID, we
usually sum-aggregate counts and mean-aggregate other expression values
(microarray or already-processed RNAseq as TPMs). This is taken care of
with the format_ids() function.
It’s common practice to normalize expression datasets (e.g. to account for technical bias). You may deal with many different profiling technologies when building references, and may join multiple datasets together for a reference.
To stay as consistent as possible, we apply quantile-normalization on
our datasets regardless of source or type. For this, we use the
normalizeBetweenArrays() function from limma.
We also log-transform the data with \(log(X + 1)\).
dsaeschimann2017$g <- limma::normalizeBetweenArrays(dsaeschimann2017$g, method = "quantile")
dsaeschimann2017$g <- log1p(dsaeschimann2017$g)
dshendriks2014$g <- limma::normalizeBetweenArrays(dshendriks2014$g, method = "quantile")
dshendriks2014$g <- log1p(dshendriks2014$g)
It’s good practice to take a look at what’s inside your variables before anything else.
dsaeschimann2017$g[1:5,1:4]
#> let.7.n2853._18hr let.7.n2853._20hr let.7.n2853._22hr let.7.n2853._24hr
#> WBGene00007063 7.501850 10.988212 10.45480 7.994587
#> WBGene00007064 8.023767 8.655388 14.21012 9.759401
#> WBGene00007065 15.919452 16.875057 15.23932 18.847718
#> WBGene00003525 1.416181 10.938876 13.42202 2.488798
#> WBGene00007067 1.765342 1.775650 2.77224 2.200257
head(dsaeschimann2017$p, n = 5)
#> title geo_accession organism_ch1 strain
#> GSM2113587 let.7.n2853._18hr GSM2113587 Caenorhabditis elegans let-7(n2853)
#> GSM2113588 let.7.n2853._20hr GSM2113588 Caenorhabditis elegans let-7(n2853)
#> GSM2113589 let.7.n2853._22hr GSM2113589 Caenorhabditis elegans let-7(n2853)
#> GSM2113590 let.7.n2853._24hr GSM2113590 Caenorhabditis elegans let-7(n2853)
#> GSM2113591 let.7.n2853._26hr GSM2113591 Caenorhabditis elegans let-7(n2853)
#> time in development:ch1 age
#> GSM2113587 18 hours 18
#> GSM2113588 20 hours 20
#> GSM2113589 22 hours 22
#> GSM2113590 24 hours 24
#> GSM2113591 26 hours 26
With time series data, correlation heatmaps or boxplots of the sample-sample correlation can reveal outliers, and also shows the clear correlation between samples of similar development.
cor_dsaeschimann2017 <- cor(dsaeschimann2017$g, method = "spearman")
# Heatmap
ord <- order(dsaeschimann2017$p$age)
heatmap(cor_dsaeschimann2017[ord, ord], Colv = NA, Rowv = NA, scale = "none", keep.dendro = F, margins = c(1,1),
RowSideColors = as.numeric(dsaeschimann2017$p$strain[ord]), labRow = "", labCol = "")
par(xpd = T) # text may have to be tweaked to plot size
mtext(text = unique(dsaeschimann2017$p$age), side = 1, line = 4, at = seq(-.1,1.05, l = 11))
# Boxplot
boxplot(cor_dsaeschimann2017 ~ interaction(dsaeschimann2017$p$strain, dsaeschimann2017$p$age), col = 1:4, xaxt = "n",
ylab = "Spearman correlation", xlab = "age", at = seq(1,44, l = 55)[c(T,T,T,T,F)])
axis(side = 1, at = seq(2,42, l = 11), labels = unique(dsaeschimann2017$p$age))
legend(23,.86, fill = 1:4, legend = c("let-7", "lin-41", "let-7/lin-41", "N2"), bty = "n")
Plotting components (PCA or ICA) with respect to time is a good way to display general dynamics in the data. For PCA, we want to perform a non-scaled, centered PCA. Centering is done gene-wise, not sample-wise (hence the matrix rotation below).
pca_dsaeschimann2017 <- stats::prcomp(t(dsaeschimann2017$g), rank = 25,
center = TRUE, scale = FALSE)
par(mfrow = c(2,4))
invisible(
sapply(seq_len(8), function(i){
plot(dsaeschimann2017$p$age, pca_dsaeschimann2017$x[,i], lwd = 2, col = dsaeschimann2017$p$strain,
xlab = "age", ylab = "PC", main = paste0("PC", i))
# connect the dots
sapply(seq_along(levels(dsaeschimann2017$p$strain)), function(l){
s <- which(dsaeschimann2017$p$strain == levels(dsaeschimann2017$p$strain)[l])
points(dsaeschimann2017$p$age[s], pca_dsaeschimann2017$x[s,i], col = l,
type = 'l', lty = 2)
})
if(i == 1)
legend("topleft", bty = 'n', legend = c("let-7", "lin-41", "let-7/lin-41", "N2"),
pch = c(rep(1, 4)), lty = c(rep(NA, 4)), col = c(1:4), lwd = 3)
})
)
In this C. elegans larval development data, we see very consistent dynamics between different strains. Also, PC2 and PC3 capture an oscillatory dynamic which is characteristic of the developmental molts of C. elegans.
Another approach is to plot a few random genes, which gives a first hand look at the noise in the data.
set.seed(10) # for reproducibility
gtp <- sample(nrow(dsaeschimann2017$g), size = 4)
par(mfrow = c(1,4))
invisible(
sapply(gtp, function(i){
plot(dsaeschimann2017$p$age, dsaeschimann2017$g[i,], lwd = 2, col = dsaeschimann2017$p$strain,
xlab = "age", ylab = "GExpr", main = rownames(dsaeschimann2017$g)[i])
# connect the dots
sapply(seq_along(levels(dsaeschimann2017$p$strain)), function(l){
s <- which(dsaeschimann2017$p$strain == levels(dsaeschimann2017$p$strain)[l])
points(dsaeschimann2017$p$age[s], dsaeschimann2017$g[i,s], col = l,
type = 'l', lty = 2)
})
if(i == gtp[4])
legend("topleft", bty = 'n', legend = c("let-7", "lin-41", "let-7/lin-41", "N2"),
pch = c(rep(1, 4)), lty = c(rep(NA, 4)), col = c(1:4), lwd = 3)
})
)
Increasing the resolution of a profiling time series is a very unbalanced regression problem. We want to predict tens of thousands of dependent variables (genes) with a few independent variables (time, batch, …).
In order to model a large number of output variable (genes), our strategy is to project the data in a dimensionally-reduced space and interpolate there before re-projecting the data back to genes. We do this with Principal Components or Independant Components ( Independant Component Analysis ).
Both PCA and ICA perform the same type of linear transformation on the data (they just optimize different criteria). We get the following :
\[ X_{(m\times n)} = G_{(m\times c)}S^{T}_{(n\times c)} \] with \(X\), the matrix of \(m\) genes by \(n\) samples, \(G\) the gene loadings (\(m\) genes by \(c\) components) and \(S^T\) the sample scores (\(n\) samples by \(c\) components). When performing PCA (or ICA) on gene expression data, \(S\) is what’s usually plotted (e.g. PC1 vs. PC2) to see how samples are grouped in the component space. It’s what we plotted earlier in the section on observing data, for instance.
Alter, Brown, and Botstein (2000) demonstrated that singular value decomposition of gene expression data can be taken as “eigengenes”, giving a global picture of the expression landscape and dynamics with a few components. GEIMs use this property. We fit a model on the columns of \(S^T\) (eigengenes), predict in the component space, and reconstruct the gene expression data by a matrix product with the gene loadings.
We’ve implemented 2 model types : Generalized Additive Models (GAMs,
the default) and Generalized Linear Models (GLMs). GAMs rely on the
gam() function of the mgcv
package, and GLMs on the glm() function of the
stats core package.
As you’ll see in the next section, a standard R formula will be
specified for the model. This formula can include any tools one can use
with gam() or glm(), most notably the variety
of polynomial or smoothing splines implemented through the
s() function of mgcv for GAMs. GLMs can also
use splines from the splines core package, such as
ns() for natural cubic splines.
Gene Expression Interpolation Models (GEIMs), are built with the
ge_im() function, which outputs a geim object.
This function takes as input 3 key arguments :
X : the gene expression matrix of your time-series
(genes as rows, samples as columns)p : a dataframe of phenotypic data, samples as rows.
This should include the age/time variable and any other covariates you
want to include in the model (e.g batch, strain)formula : the model formula. This should be a standard
R formula, using terms found in p. It must start
with X ~.Another important argument is the number of
components to interpolate on, nc.
For example, using the dsaeschimann2017 dataset we could
build the following model.
m_dsaeschimann2017 <- ge_im(X = dsaeschimann2017$g, p = dsaeschimann2017$p,
formula = "X ~ s(age, bs = 'ts') + strain", nc = 32)
Note that a single model formula is specified and applied to all the components, but models are fitted independently on the components.
Additional parameters are detailed in the documentation of the
function ?ge_im().
Model predictions can be generated with the predict()
function, as for any standard R model.
There are 5 types of GEIMs:
method = "gam", dim_red = "pca") (default)method = "glm", dim_red = "pca")method = "gam", dim_red = "ica")method = "glm", dim_red = "ica")method = "limma")Our default GEIM is fitting GAMs on PCA components, which is a robust choice when applying a smoothing spline to the data.
PCA and ICA interpolation usually yield near-identical results. ICA
tends to outperform PCA when the data is very noisy. This is by design,
since ICA essentially performs signal extraction. It is however slower,
especially if nc is large.
The last option ("limma") corresponds to a solution that
makes no effort to reduce the dimensionality of the problem
(dim_red and nc arguments are ignored). As a
result, there is no information loss or bias introduced by dimension
reduction. This approach is however very sensitive to noise. A model is
fit with the lmFit() function of the limma
package (hence the option name).
Note that when using GAMs, there can be no interaction between terms
(by definition). It is possible to include a by argument to
the s() function, which essentially corresponds to separate
model fits on each level of the specified group variable (and thus
requires enough data to do so).
Model performance can be evaluated through multiple criteria. The
mperf() function computes the indices described below, when
given the data and model predictions.
In the formulas below, \(X\) corresponds to the input gene expression matrix (\(m\) genes as rows, \(n\) samples as columns), \(\hat{X}\) to the model predictions. \(x_i\) corresponds to row \(i\) of matrix \(X\) and \(x_i^{(j)}\) to sample \(j\) of that row. This notation is derived from the general regression problem, where \(X^T\) corresponds to the set of \(m\) dependant variable to predict.
aCC : average Correlation Coefficient.\[ aCC = \frac{1}{m}\sum^{m}_{i=1}{CC} = \frac{1}{m}\sum^{m}_{i=1}{\cfrac{\sum^{n}_{j=1}{(x_i^{(j)}-\bar{x}_i)(\hat{x}_i^{(j)}-\bar{\hat{x}}_i)}}{\sqrt{\sum^{n}_{j=1}{(x_i^{(j)}-\bar{x}_i)^2(\hat{x}_i^{(j)}-\bar{\hat{x}}_i)^2}}}} \]
aRE : average Relative Error.\[ a\delta = \frac{1}{m}\sum^{m}_{i=1}{\delta} = \frac{1}{m} \sum^{m}_{i=1} \frac{1}{n} \sum^{n}_{j=1} \cfrac{| x_i^{(j)} - \hat{x}_i^{(j)} | }{x_i^{(j)}} \]
MSE : Mean Squared Error.\[ MSE = \frac{1}{m} \sum^{m}_{i=1} \frac{1}{n} \sum^{n}_{j=1} (x_i^{(j)} - \hat{x}_i^{(j)} )^2 \]
aRMSE : average Root MSE.\[ aRMSE = \frac{1}{m}\sum^{m}_{i=1}{RMSE} = \frac{1}{m} \sum^{m}_{i=1} \sqrt{\cfrac{\sum^{n}_{j=1} (x_i^{(j)} - \hat{x}_i^{(j)} )^2}{n}} \]
Note that these indices are computed and averaged with respect to
variables (genes), not observations. mperf() outputs
either the overall (averaged) index, or values per-gene
(global parameter).
g_mp <- mperf(dsaeschimann2017$g, predict(m_dsaeschimann2017), is.t = T)
g_mp
#> $aCC
#> [1] 0.7977299
#>
#> $aRE
#> [1] 0.1301014
#>
#> $MSE
#> [1] 0.01431891
#>
#> $aRMSE
#> [1] 0.1196617
ng_mp <- mperf(dsaeschimann2017$g, predict(m_dsaeschimann2017), is.t = T, global = F)
ng_mp <- lapply(ng_mp, na.omit) # remove NAs (eg. 0 variance genes)
ng_mp$aRE <- ng_mp$aRE[ng_mp$aRE < Inf] # remove Inf values (/0)
It’s possible to check the model performance by looking at the index distributions over all genes, e.g. :
par(mfrow = c(2,2))
invisible(
sapply(names(ng_mp), function(idx){
rg <- range(na.omit(ng_mp[[idx]]))
# label position
if(idx == "aCC"){
pos <- 2
} else {
pos <- 4
}
# estimate density curve
d <- density(na.omit(ng_mp[[idx]]), from = rg[1], to = rg[2])
plot(d, main = paste0(gsub("a", "", idx, fixed = T), " density (", length(ng_mp[[idx]]), " genes)"),
xlab = idx, lwd = 2)
# display global value
abline(v = g_mp[[idx]], lty = 2, lwd = 2, col = "firebrick")
text(g_mp[[idx]], .9*max(d$y), pos = pos, labels = idx, font = 2, col = "firebrick")
})
)
By default, the number of components to interpolate is set to the number of samples. However, we recommend setting a cutoff on explained variance of PCA (or ICA) components.
For example, on the dsaeschimann2017 dataset, we set the
threshold at \(99\%\) :
nc <- sum(summary(pca_dsaeschimann2017)$importance[3,] < .99) + 1
nc
#> [1] 32
This threshold must be set in accordance with the noise in the data. For example, in very noisy data, would you consider that \(99\%\) of the variance in the dataset corresponds to meaningful information ?
You can also define nc by plotting your components and
stopping after the components stop capturing meaningful variation
(dynamics) with respect to time/age. We define components with
‘intelligible dynamics’ with respect to time as those where a model fit
explains \(\gt0.5\) of the deviance
(noisy components with no dynamics have poor fits).
In very noisy data, you may have to keep very few components (\(<5\)) for the interpolation.
Choosing from different splines (and/or parameters) can be done with
cross-validation (CV). The ge_imCV() function inputs the
X, p and a formula_list to test.
Other parameters on the CV itself can also be given (e.g.
training set size).
The default training/validation set ratio is cv.s = 0.8,
so \(80\%\) of the data is used to
build the model. When including (factor) covariates in the model, the
training set is built such that all groups are proportionately
represented in the training set (based on terms of the first formula in
the list). The number of repeats to do for the CV is defined by
cv.n.
The model type (GAM/GLM and PCA/ICA) is fixed for all formulas in one
call of ge_imCV().
ge_imCV() computes the indices of model performance with
mperf(), excluding aCC due to computing time.
Indices are computed on the validation set (CV Error) and on
the training set (Model PerFormance).
Below is an example of usage to choose between 4 available GAM smooth
terms on the dsaeschimann2017 GEIM.
smooth_methods <- c("tp", "ts", "cr", "ps")
flist <- as.list(paste0("X ~ s(age, bs = \'", smooth_methods, "\') + strain"))
flist
#> [[1]]
#> [1] "X ~ s(age, bs = 'tp') + strain"
#>
#> [[2]]
#> [1] "X ~ s(age, bs = 'ts') + strain"
#>
#> [[3]]
#> [1] "X ~ s(age, bs = 'cr') + strain"
#>
#> [[4]]
#> [1] "X ~ s(age, bs = 'ps') + strain"
cv_dsaeschimann2017 <- ge_imCV(X = dsaeschimann2017$g, p = dsaeschimann2017$p, formula_list = flist,
cv.n = 20, nc = nc, nb.cores = 3)
#> CV on 4 models. cv.n = 20 | cv.s = 0.8
#>
#> ...Building training sets
#> ...Setting up cluster
#> ...Running CV
#> ...Cleanup and formatting
plot(cv_dsaeschimann2017, names = paste0("bs = ", smooth_methods), outline = F,
swarmargs = list(cex = .8))
From the plots above, we can see the different splines perform
similarly. All could work. We chose ts (a thin-plate
regression spline), as it seems to minimize CV error without much impact
on model performance.
Extra spline parameters can also be specified to the model. With
s(), you can give the spline’s basis dimension (\(\simeq\) number of knots) to use with the
k parameter. By default, the spline is a penalized
spline, so it will not necessarily use k knots, but it
will stay below that value. In our experience, the parameter estimation
done by gam() is usually sufficient and rarely requires
tweaking.
By setting fx = TRUE, a spline with k basis
dimension is forced. Note this fits a spline of dimension k
on all components, whereas the penalized spline will adjust.
The s() or choose.k documentation gives
further information.
Below are examples of spline parameter tweaking with the
dsaeschimann2017 data.
ks <- c(4,6,8,10)
flistk <- as.list(c(
"X ~ s(age, bs = 'cr') + strain",
paste0("X ~ s(age, bs = 'cr', k = ", ks , ") + strain"),
paste0("X ~ s(age, bs = 'cr', k = ", ks , ", fx=TRUE) + strain")
))
flistk
#> [[1]]
#> [1] "X ~ s(age, bs = 'cr') + strain"
#>
#> [[2]]
#> [1] "X ~ s(age, bs = 'cr', k = 4) + strain"
#>
#> [[3]]
#> [1] "X ~ s(age, bs = 'cr', k = 6) + strain"
#>
#> [[4]]
#> [1] "X ~ s(age, bs = 'cr', k = 8) + strain"
#>
#> [[5]]
#> [1] "X ~ s(age, bs = 'cr', k = 10) + strain"
#>
#> [[6]]
#> [1] "X ~ s(age, bs = 'cr', k = 4, fx=TRUE) + strain"
#>
#> [[7]]
#> [1] "X ~ s(age, bs = 'cr', k = 6, fx=TRUE) + strain"
#>
#> [[8]]
#> [1] "X ~ s(age, bs = 'cr', k = 8, fx=TRUE) + strain"
#>
#> [[9]]
#> [1] "X ~ s(age, bs = 'cr', k = 10, fx=TRUE) + strain"
cv_dsaeschimann2017k <- ge_imCV(X = dsaeschimann2017$g, p = dsaeschimann2017$p, formula_list = flistk,
cv.n = 20, nc = nc, nb.cores = 3)
#> CV on 9 models. cv.n = 20 | cv.s = 0.8
#>
#> ...Building training sets
#> ...Setting up cluster
#> ...Running CV
#> ...Cleanup and formatting
A ref object can be built from a GEIM using
make_ref(), specifying interpolation resolution and
relevant metadata.
On our dsaeschimann2017 example :
r_dsaeschimann2017 <- make_ref(m = m_dsaeschimann2017,
n.inter = 100, # interpolation resolution
t.unit = "h past egg-laying", # time unit
cov.levels = list("strain" = "N2"), # covariate levels to use for interpolation
metadata = list("organism" = "C. elegans", # any metadata
"profiling" = "whole-organism, bulk",
"technology" = "RNAseq"))
As any R model, GEIMs also have a predict() function,
which can used to manually output predictions either in component space
or at the gene level. Doing so manually instead of through
make_ref() can be useful for a deeper look at the
model.
# first generate the new predictor data
n.inter <- 100 # nb of new timepoints
newdat <- data.frame(
age = seq(min(dsaeschimann2017$p$age), max(dsaeschimann2017$p$age), l = n.inter),
strain = rep("N2", n.inter) # we want to predict as N2
)
head(newdat)
#> age strain
#> 1 18.00000 N2
#> 2 18.20202 N2
#> 3 18.40404 N2
#> 4 18.60606 N2
#> 5 18.80808 N2
#> 6 19.01010 N2
# predict at gene level
pred_m_dsaeschimann2017 <- predict(m_dsaeschimann2017, newdata = newdat)
# predict at component level
pred_m_dsaeschimann2017_comp <- predict(m_dsaeschimann2017, newdata = newdat, as.c = TRUE)
After building a reference, we can check interpolation results by:
Checking predictions against components allows you to immediately see if some dynamics get mishandled by the model.
Don’t hope for perfect fits ! It’s acceptable to have slight offsets.
You may also notice some noisy components get “flattened”, with a null model fitted. These components can be left in or removed as they generally have little to no impact on interpolation at the gene level (representing a minuscule part of total variance in the data). This sometimes actually gets rid of unwanted variation.
Plotting a model and a reference object (or equivalent metadata) shows component interpolation, with deviance explained (DE) and relative error (RE) for each component (this information is also returned by the plot function). DE can be used to define components with “intelligible” dynamics (w.r.t. time), when \(DE>0.5\). In noisy data, this distinction can be useful to remove components which do not reflect meaningful developmental variation (but rather noise).
Predictions of the first few components from
dsaeschimann2017 are plotted below.
par(mfrow = c(2,4))
fit_vals <- plot(m_dsaeschimann2017, r_dsaeschimann2017, ncs=1:8, col = dsaeschimann2017$p$strain, col.i = 'royalblue')
head(fit_vals)
#> component.var.exp r2 deviance.expl relative.err
#> PC1 0.68673 0.9972253 0.9972253 0.1658575
#> PC2 0.09546 0.9230361 0.9230361 0.3898380
#> PC3 0.08096 0.8783276 0.8783276 1.2464485
#> PC4 0.03242 0.9249808 0.9249808 5.3703222
#> PC5 0.01875 0.8510695 0.8510695 1.6352080
#> PC6 0.01274 0.9249475 0.9249475 2.9281560
Of note, we are predicting model values as N2
(lightblue). While all strains are shown on the plots above, some model
parameters depend on the selected N2 strain.
The interpolation should translate well on the full expression matrix :
par(mfrow = c(1,4))
invisible(
sapply(gtp, function(i){ # gtp is from the earlier random ^gene plots
plot(dsaeschimann2017$p$age, dsaeschimann2017$g[i,], lwd = 2, col = dsaeschimann2017$p$strain,
xlab = "age", ylab = "GExpr", main = rownames(dsaeschimann2017$g)[i])
# connect the dots
sapply(seq_along(levels(dsaeschimann2017$p$strain)), function(l){
s <- which(dsaeschimann2017$p$strain == levels(dsaeschimann2017$p$strain)[l])
points(dsaeschimann2017$p$age[s], dsaeschimann2017$g[i,s], col = l,
type = 'l', lty = 2)
})
# add model prediction
points(r_dsaeschimann2017$time, r_dsaeschimann2017$interpGE[i, ], col = "royalblue", type = 'l', lwd = 3)
if(i == gtp[4])
legend("topleft", bty = 'n', legend = c("let-7", "lin-41", "let-7/lin-41", "N2"),
pch = c(rep(1, 4)), lty = c(rep(NA, 4)), col = c(1:4), lwd = 3)
})
)
Staging the samples used to build the reference on their interpolated version is a good first test.
Then, staging another time-series from the literature on your reference is the best validation, if such data exists. This external validation confirms the interpolated dynamics indeed correspond to development processes and not a dataset-specific effect (which is unlikely, but not impossible).
For our example, we can use the dshendriks2014 dataset
for external validation.
ae_test_dsaeschimann2017 <- ae(dsaeschimann2017$g, r_dsaeschimann2017)
ae_test_dshendriks2014 <- ae(dshendriks2014$g, r_dsaeschimann2017)
par(mfrow = c(1,2))
rg <- range(c(ae_test_dsaeschimann2017$age.estimates[,1], dsaeschimann2017$p$age))
# Plot 1
plot(ae_test_dsaeschimann2017$age.estimates[,1]~dsaeschimann2017$p$age,
xlab = "Chronological age", ylab = "Estimated age (dsaeschimann2017)",
xlim = rg, ylim = rg,
main = "Chron. vs Estimated ages for dsaeschimann2017\n(on dsaeschimann2017 reference)", lwd = 2,
col = factor(dsaeschimann2017$p$strain))
# connect the dots
invisible(sapply(levels(factor(dsaeschimann2017$p$strain)), function(l){
s <- dsaeschimann2017$p$strain == l
points(ae_test_dsaeschimann2017$age.estimates[s,1]~dsaeschimann2017$p$age[s], type = 'l',
lty = 2, col = which(l==levels(factor(dsaeschimann2017$p$strain))))
}))
abline(a = 0, b = 1, lty = 3, lwd = 2) # x = y
legend("bottomright", legend = c("let-7", "lin-41", "let-7/lin-41", "N2", "x = y"),
lwd=3, col=c(1:4, 1), bty='n', pch = c(1,1,1,1,NA), lty = c(rep(NA, 4), 3))
# Plot 2
rg <- range(c(ae_test_dshendriks2014$age.estimates[,1], dshendriks2014$p$age))
plot(ae_test_dshendriks2014$age.estimates[,1]~dshendriks2014$p$age,
xlab = "Chronological age", ylab = "Estimated age (dsaeschimann2017)",
xlim = rg, ylim = rg,
main = "Chron. vs Estimated ages for dshendriks2014\n(on dsaeschimann2017 reference)", lwd = 2)
# connect the dots
points(ae_test_dshendriks2014$age.estimates[,1] ~ dshendriks2014$p$age, type = 'l', lty = 2)
abline(a = 0, b = 1, lty = 3, lwd = 2) # x = y
legend("bottomright", legend = "x = y", lwd=3, col=1, lty = 3, bty='n')
Here you will find a few examples of reference building on different organisms.
The data of this example is used for the in-text examples throughout the reference-building vignette.
Here, we use two C. elegans RNAseq time-series:
dsaeschimann2017. This is the data we use to build the
reference. (Accession : GSE80157)dshendriks2014. This data is used for external validation.
(Accession : GSE52861)Code to generate dsaeschimann2017 and
dshendriks2014 :
Note : set the data_folder variable to an
existing path on your system where you want to store the
objects.
data_folder <- "../inst/extdata/"
requireNamespace("wormRef", quietly = T)
requireNamespace("utils", quietly = T)
requireNamespace("GEOquery", quietly = T) # May need to be installed with bioconductor
requireNamespace("Biobase", quietly = T)
raw2tpm <- function(rawcounts, genelengths){
if(nrow(rawcounts) != length(genelengths))
stop("genelengths must match nrow(rawcounts).")
x <- rawcounts/genelengths
return(t( t(x) * 1e6 / colSums(x) ))
}
fpkm2tpm <- function(fpkm){
return(exp(log(fpkm) - log(colSums(fpkm)) + log(1e6)))
}
dsaeschimann2017geo_dsaeschimann2017 <- "GSE80157"
g_url_dsaeschimann2017 <- GEOquery::getGEOSuppFiles(geo_dsaeschimann2017, makeDirectory = FALSE, fetch_files = FALSE)
g_file_dsaeschimann2017 <- paste0(data_folder, "dsaeschimann2017.txt.gz")
utils::download.file(url = as.character(g_url_dsaeschimann2017$url[2]), destfile = g_file_dsaeschimann2017)
X_dsaeschimann2017 <- read.table(gzfile(g_file_dsaeschimann2017), h=T, sep = '\t', stringsAsFactors = F, row.names = 1)
# convert to tpm & wb_id
X_dsaeschimann2017 <- X_dsaeschimann2017[rownames(X_dsaeschimann2017)%in%wormRef::Cel_genes$wb_id,]
X_dsaeschimann2017 <- raw2tpm(rawcounts = X_dsaeschimann2017,
genelengths = wormRef::Cel_genes$transcript_length[match(rownames(X_dsaeschimann2017),
wormRef::Cel_genes$wb_id)])
# pheno data
P_dsaeschimann2017 <- Biobase::pData(GEOquery::getGEO(geo_dsaeschimann2017, getGPL = F)[[1]])
P_dsaeschimann2017[,10:34] <- NULL
P_dsaeschimann2017[, 3:8] <- NULL
colnames(P_dsaeschimann2017)[4] <- "strain"
P_dsaeschimann2017$strain <- factor(P_dsaeschimann2017$strain)
P_dsaeschimann2017$title <- make.names(P_dsaeschimann2017$title)
colnames(X_dsaeschimann2017) <- gsub('RNASeq_riboM_', '', colnames(X_dsaeschimann2017), fixed = T)
P_dsaeschimann2017$title <- gsub('RNASeq_riboM_', '', P_dsaeschimann2017$title, fixed = T)
# get age
P_dsaeschimann2017$age <- as.numeric(sub('(\\d+)\\shours', '\\1', P_dsaeschimann2017$`time in development:ch1`))
X_dsaeschimann2017 <- X_dsaeschimann2017[, P_dsaeschimann2017$title]
dsaeschimann2017 <- list(g = X_dsaeschimann2017, p = P_dsaeschimann2017)
save(dsaeschimann2017, file = paste0(data_folder, "dsaeschimann2017.RData"), compress = "xz")
# cleanup
file.remove(g_file_dsaeschimann2017)
rm(geo_dsaeschimann2017, g_url_dsaeschimann2017, g_file_dsaeschimann2017, X_dsaeschimann2017, P_dsaeschimann2017)
dshendriks2014geo_dshendriks2014 <- "GSE52861"
g_url_dshendriks2014 <- GEOquery::getGEOSuppFiles(geo_dshendriks2014, makeDirectory = FALSE, fetch_files = FALSE)
g_file_dshendriks2014 <- paste0(data_folder, "dshendriks2014.txt.gz")
utils::download.file(url = as.character(g_url_dshendriks2014$url[2]), destfile = g_file_dshendriks2014)
X_dshendriks2014 <- read.table(gzfile(g_file_dshendriks2014), h=T, sep = '\t', stringsAsFactors = F, row.names = 1)
# convert to tpm & wb_id
X_dshendriks2014 <- X_dshendriks2014[rownames(X_dshendriks2014)%in%wormRef::Cel_genes$wb_id,]
X_dshendriks2014 <- raw2tpm(rawcounts = X_dshendriks2014,
genelengths = wormRef::Cel_genes$transcript_length[match(rownames(X_dshendriks2014),
wormRef::Cel_genes$wb_id)])
# pheno data
P_dshendriks2014 <- Biobase::pData(GEOquery::getGEO(geo_dshendriks2014, getGPL = F)[[1]])
# filter relevant fields/samples
P_dshendriks2014 <- P_dshendriks2014[(P_dshendriks2014$`strain:ch1` == 'N2') & (P_dshendriks2014$`growth protocol:ch1` == 'Continuous'), ]
P_dshendriks2014 <- P_dshendriks2014[, c("title", "geo_accession", "time in development:ch1")]
# get age
P_dshendriks2014$age <- as.numeric(sub('(\\d+)\\shours', '\\1', P_dshendriks2014$`time in development:ch1`))
# formatting
P_dshendriks2014$title <- gsub('RNASeq_polyA_', '',
gsub('hr', 'h',
gsub('-', '.', fixed = T, as.character(P_dshendriks2014$title))))
colnames(X_dshendriks2014) <- gsub('RNASeq_polyA_','', colnames(X_dshendriks2014))
X_dshendriks2014 <- X_dshendriks2014[, P_dshendriks2014$title]
dshendriks2014 <- list(g = X_dshendriks2014, p = P_dshendriks2014)
save(dshendriks2014, file = paste0(data_folder, "dshendriks2014.RData"), compress = "xz")
# cleanup
file.remove(g_file_dshendriks2014)
rm(geo_dshendriks2014, g_url_dshendriks2014, g_file_dshendriks2014, X_dshendriks2014, P_dshendriks2014)
dsaeschimann2017$g <- limma::normalizeBetweenArrays(dsaeschimann2017$g, method = "quantile")
dsaeschimann2017$g <- log1p(dsaeschimann2017$g)
dshendriks2014$g <- limma::normalizeBetweenArrays(dshendriks2014$g, method = "quantile")
dshendriks2014$g <- log1p(dshendriks2014$g)
dsaeschimann2017$g[1:5,1:4]
#> let.7.n2853._18hr let.7.n2853._20hr let.7.n2853._22hr let.7.n2853._24hr
#> WBGene00007063 2.1206604 2.469532 2.373273 2.175924
#> WBGene00007064 2.1621558 2.260804 2.661102 2.354485
#> WBGene00007065 2.7763061 2.847833 2.727037 2.960098
#> WBGene00003525 0.9434159 2.466223 2.609585 1.313603
#> WBGene00007067 1.0787531 1.081964 1.350796 1.236899
head(dsaeschimann2017$p, n = 5)
#> title geo_accession organism_ch1 strain
#> GSM2113587 let.7.n2853._18hr GSM2113587 Caenorhabditis elegans let-7(n2853)
#> GSM2113588 let.7.n2853._20hr GSM2113588 Caenorhabditis elegans let-7(n2853)
#> GSM2113589 let.7.n2853._22hr GSM2113589 Caenorhabditis elegans let-7(n2853)
#> GSM2113590 let.7.n2853._24hr GSM2113590 Caenorhabditis elegans let-7(n2853)
#> GSM2113591 let.7.n2853._26hr GSM2113591 Caenorhabditis elegans let-7(n2853)
#> time in development:ch1 age
#> GSM2113587 18 hours 18
#> GSM2113588 20 hours 20
#> GSM2113589 22 hours 22
#> GSM2113590 24 hours 24
#> GSM2113591 26 hours 26
pca_dsaeschimann2017 <- stats::prcomp(t(dsaeschimann2017$g), rank = 25,
center = TRUE, scale = FALSE)
nc <- sum(summary(pca_dsaeschimann2017)$importance[3,] < .99) + 1
nc
#> [1] 32
m_dsaeschimann2017 <- ge_im(X = dsaeschimann2017$g, p = dsaeschimann2017$p,
formula = "X ~ s(age, bs = 'ts') + strain", nc = nc)
#> aCC aRE MSE aRMSE
#> 0.7977299 0.1301014 0.01431891 0.1196617
r_dsaeschimann2017 <- make_ref(m = m_dsaeschimann2017,
n.inter = 100, # interpolation resolution
t.unit = "h past egg-laying", # time unit
cov.levels = list("strain" = "N2"), # covariate levels to use for interpolation
metadata = list("organism" = "C. elegans", # any metadata
"profiling" = "whole-organism, bulk",
"technology" = "RNAseq"))
Here, we use two Drosophila melanogaster embryo development time-series.
dsgraveley2011. This is the data used to
build the reference. (Data downloaded from fruitfly.org)dslevin2016dmel. This data is used for external validation.
(Accession : GSE60471)Of note, the reference data has a low time resolution, which displays the effectiveness of gene expression interpolation.
Code to generate dsgraveley2011 and
dslevin2016dmel :
Note : set the data_folder variable to an
existing path on your system where you want to store the
objects.
data_folder <- "../inst/extdata/"
requireNamespace("utils", quietly = T)
requireNamespace("GEOquery", quietly = T) # May need to be installed with bioconductor
requireNamespace("Biobase", quietly = T)
raw2tpm <- function(rawcounts, genelengths){
if(nrow(rawcounts) != length(genelengths))
stop("genelengths must match nrow(rawcounts).")
x <- rawcounts/genelengths
return(t( t(x) * 1e6 / colSums(x) ))
}
fpkm2tpm <- function(fpkm){
return(exp(log(fpkm) - log(colSums(fpkm)) + log(1e6)))
}
requireNamespace("biomaRt", quietly = TRUE)
mart <- biomaRt::useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")
droso_genes <- biomaRt::getBM(attributes = c("ensembl_gene_id",
"ensembl_transcript_id",
"external_gene_name",
"transcript_end", "transcript_start"),
mart = mart)
droso_genes$transcript_length <- droso_genes$transcript_end - droso_genes$transcript_start
droso_genes <- droso_genes[,c(1:3,6)]
colnames(droso_genes)[1:3] <- c("fb_id", "transcript_id", "gene_name")
rm(mart)
dsgraveley2011g_url_dsgraveley2011 <- "ftp://ftp.fruitfly.org/pub/download/modencode_expression_scores/Celniker_Drosophila_Annotation_20120616_1428_allsamps_MEAN_gene_expression.csv.gz"
g_file_dsgraveley2011 <- paste0(data_folder, "dsgraveley2011.csv.gz")
utils::download.file(g_url_dsgraveley2011, destfile = g_file_dsgraveley2011)
X_dsgraveley2011 <- read.table(gzfile(g_file_dsgraveley2011), sep = ',', row.names = 1, h = T)
# convert gene ids to FBgn
X_dsgraveley2011 <- RAPToR::format_ids(X_dsgraveley2011, droso_genes, from = "gene_name", to = "fb_id")
# select embryo time series samples
X_dsgraveley2011 <- X_dsgraveley2011[,1:12]
P_dsgraveley2011 <- data.frame(sname = colnames(X_dsgraveley2011),
age = as.numeric(gsub("em(\\d+)\\.\\d+hr", "\\1", colnames(X_dsgraveley2011))),
stringsAsFactors = FALSE)
dsgraveley2011 <- list(g = X_dsgraveley2011, p = P_dsgraveley2011)
save(dsgraveley2011, file = paste0(data_folder, "dsgraveley2011.RData"), compress = "xz")
# cleanup
file.remove(g_file_dsgraveley2011)
rm(g_url_dsgraveley2011, g_file_dsgraveley2011, X_dsgraveley2011, P_dsgraveley2011)
dslevin2016dmelgeo_dslevin2016dmel <- "GSE60471"
g_url_dslevin2016dmel <- GEOquery::getGEOSuppFiles(geo_dslevin2016dmel, makeDirectory = FALSE, fetch_files = FALSE)
g_file_dslevin2016dmel <- paste0(data_folder, "dslevin2016dmel.txt.gz")
utils::download.file(url = as.character(g_url_dslevin2016dmel$url[3]), destfile = g_file_dslevin2016dmel)
X_dslevin2016dmel <- read.table(gzfile(g_file_dslevin2016dmel), h = T, sep = '\t', as.is = T, row.names = 1, comment.char = "")
# filter poor quality samples
cm_dslevin2016dmel <- RAPToR::cor.gene_expr(X_dslevin2016dmel, X_dslevin2016dmel)
f_dslevin2016dmel <- which(0.6 > apply(cm_dslevin2016dmel, 1, quantile, probs = .99))
X_dslevin2016dmel <- X_dslevin2016dmel[, -f_dslevin2016dmel]
# convert to tpm & FBgn
X_dslevin2016dmel <- X_dslevin2016dmel[rownames(X_dslevin2016dmel)%in%droso_genes$fb_id,]
X_dslevin2016dmel <- raw2tpm(rawcounts = X_dslevin2016dmel,
genelengths = droso_genes$transcript_length[match(rownames(X_dslevin2016dmel),
droso_genes$fb_id)])
# pheno data
P_dslevin2016dmel <- Biobase::pData(GEOquery::getGEO(geo_dslevin2016dmel, getGPL = F)[[1]])
# filter relevant fields/samples
P_dslevin2016dmel <- P_dslevin2016dmel[, c("title", "geo_accession", "time (minutes cellularization stage):ch1")]
colnames(P_dslevin2016dmel)[3] <- "time"
P_dslevin2016dmel$title <- as.character(P_dslevin2016dmel$title)
P_dslevin2016dmel <- P_dslevin2016dmel[P_dslevin2016dmel$title %in% colnames(X_dslevin2016dmel),]
X_dslevin2016dmel <- X_dslevin2016dmel[, P_dslevin2016dmel$title]
# formatting
P_dslevin2016dmel$title <- gsub('Metazome_Drosophila_timecourse_', '', P_dslevin2016dmel$title)
colnames(X_dslevin2016dmel) <- P_dslevin2016dmel$title
P_dslevin2016dmel$age <- as.numeric(P_dslevin2016dmel$time) / 60
dslevin2016dmel <- list(g = X_dslevin2016dmel, p = P_dslevin2016dmel)
save(dslevin2016dmel, file = paste0(data_folder, "dslevin2016dmel.RData"), compress = "xz")
# cleanup
file.remove(g_file_dslevin2016dmel)
rm(geo_dslevin2016dmel, g_url_dslevin2016dmel, g_file_dslevin2016dmel,
X_dslevin2016dmel, P_dslevin2016dmel,
cm_dslevin2016dmel, f_dslevin2016dmel)
rm(droso_genes, raw2tpm, fpkm2tpm)
dsgraveley2011$g <- limma::normalizeBetweenArrays(dsgraveley2011$g, method = "quantile")
dsgraveley2011$g <- log1p(dsgraveley2011$g)
dslevin2016dmel$g <- limma::normalizeBetweenArrays(dslevin2016dmel$g, method = "quantile")
dslevin2016dmel$g <- log1p(dslevin2016dmel$g)
dsgraveley2011$g[1:5, 1:5]
#> em0.2hr em2.4hr em4.6hr em6.8hr em8.10hr
#> FBgn0000003 3.9651391 4.2738527 3.3174101 4.5644242 4.6982706
#> FBgn0000008 1.2949845 0.9215699 0.6958672 0.6476801 0.7445991
#> FBgn0000014 0.5099295 0.9512866 1.3952815 1.8610406 1.8421960
#> FBgn0000015 0.2435639 0.6423988 1.0511912 1.1094674 1.0194280
#> FBgn0000017 1.7968429 2.0901351 1.3389420 1.5336183 1.6777064
head(dsgraveley2011$p, n = 5)
#> sname age
#> 1 em0.2hr 0
#> 2 em2.4hr 2
#> 3 em4.6hr 4
#> 4 em6.8hr 6
#> 5 em8.10hr 8
pca_dsgraveley2011 <- stats::prcomp(t(dsgraveley2011$g), rank = 12,
center = TRUE, scale = FALSE)
nc <- sum(summary(pca_dsgraveley2011)$importance[3,] < .99) + 1
nc
#> [1] 8
m_dsgraveley2011 <- ge_im(X = dsgraveley2011$g, p = dsgraveley2011$p, formula = "X ~ s(age, bs = 'cr')", nc = nc)
#> aCC aRE MSE aRMSE
#> 0.9400274 0.4040618 0.009908293 0.09954041
r_dsgraveley2011 <- make_ref(m = m_dsgraveley2011,
n.inter = 100, # interpolation resolution
t.unit = "h past egg-laying", # time unit
metadata = list("organism" = "D. melanogaster", # any metadata
"profiling" = "whole-organism, bulk",
"technology" = "RNAseq"))
ae_dsgraveley2011 <- ae(dsgraveley2011$g, r_dsgraveley2011)
ae_dslevin2016dmel <- ae(dslevin2016dmel$g, r_dsgraveley2011)
We notice here that our validation data age estimates vary from
chronological age. However, this is due to the single-embryo nature of
the data. Indeed, inter-individual variability is shown clearly, where
it would otherwise be averaged out in bulk data. Furthermore, if we look
at the dynamics of the dslevin2016dmel data, we’ll see that
chronological age specified for the samples is erroneous, as reflected
by noisy expression dynamics.
pca_dslevin2016dmel <- stats::prcomp(t(dslevin2016dmel$g), rank = 20,
center = TRUE, scale = FALSE)
This demonstrates a difficulty of producing high-resolution time series due to developmental asynchronicity between samples.
Here, we use two Danio rerio (zebrafish) embryo development time-series.
The reference data has uneven time sampling, as can often be the case. We show a trick using ranks to build an adequate model in order to avoid interpolation bias.
The datasets are
dswhite2017. This data is
used to build the reference. (Data accessible in the
publication)dslevin2016zeb. This data is used for external validation.
(Accession : GSE60619)Code to generate dswhite2017 and
dslevin2016zeb :
Note : set the data_folder variable to an
existing path on your system where you want to store the
objects.
data_folder <- "../inst/extdata/"
requireNamespace("utils", quietly = T)
requireNamespace("GEOquery", quietly = T) # May need to be installed with bioconductor
requireNamespace("Biobase", quietly = T)
raw2tpm <- function(rawcounts, genelengths){
if(nrow(rawcounts) != length(genelengths))
stop("genelengths must match nrow(rawcounts).")
x <- rawcounts/genelengths
return(t( t(x) * 1e6 / colSums(x) ))
}
fpkm2tpm <- function(fpkm){
return(exp(log(fpkm) - log(colSums(fpkm)) + log(1e6)))
}
requireNamespace("biomaRt", quietly = TRUE)
mart <- biomaRt::useMart("ensembl", dataset = "drerio_gene_ensembl")
zeb_genes <- biomaRt::getBM(attributes = c("ensembl_gene_id", "transcript_length"),
mart = mart)
rm(mart)
dswhite2017p_url_dswhite2017 <- "http://europepmc.org/articles/PMC5690287/bin/elife-30860-supp1.tsv"
g_url_dswhite2017 <- "http://europepmc.org/articles/PMC5690287/bin/elife-30860-supp2.tsv"
g_file_dswhite2017 <- paste0(data_folder, "dswhite2017.tsv")
utils::download.file(g_url_dswhite2017, destfile = g_file_dswhite2017)
X_dswhite2017 <- read.table(g_file_dswhite2017, h = T, sep ="\t", as.is = T, quote = "\"")
rownames(X_dswhite2017) <- X_dswhite2017$Gene.ID
X_dswhite2017 <- X_dswhite2017[,-(1:8)]
# convert to tpm & ensembl_id
X_dswhite2017 <- X_dswhite2017[rownames(X_dswhite2017)%in%zeb_genes$ensembl_gene_id,]
X_dswhite2017 <- raw2tpm(rawcounts = X_dswhite2017,
genelengths = zeb_genes$transcript_length[match(rownames(X_dswhite2017),
zeb_genes$ensembl_gene_id)])
# pheno data
P_dswhite2017 <- read.table(p_url_dswhite2017, h = T, sep = "\t", as.is = T)
P_dswhite2017 <- P_dswhite2017[P_dswhite2017$sequencing == "RNASeq", c("sample", "accession_number", "stage", "stageName", "sampleName")]
# timings of stages from the White et al. eLife (2017) publication of the data.
# time given in hours post-fertilization
timepoints <- data.frame(stage = unique(P_dswhite2017$stageName),
hours_pf = c(0, .75, 2.25, 3, 4.3, 5.25, 6, 8, 10.3,
16, 19, 24, 30, 36, 48, 72, 96, 120),
stringsAsFactors = F, row.names = "stage")
P_dswhite2017$age <- timepoints[P_dswhite2017$stageName, "hours_pf"]
P_dswhite2017$batch <- factor(gsub(".*-(\\d)$", "\\1", P_dswhite2017$sampleName))
X_dswhite2017 <- X_dswhite2017[, P_dswhite2017$sample]
dswhite2017 <- list(g = X_dswhite2017, p = P_dswhite2017)
save(dswhite2017, file = paste0(data_folder, "dswhite2017.RData"), compress = "xz")
# cleanup
file.remove(g_file_dswhite2017)
rm(p_url_dswhite2017, g_url_dswhite2017, g_file_dswhite2017, X_dswhite2017, P_dswhite2017, timepoints)
dslevin2016zebgeo_dslevin2016zeb <- "GSE60619"
g_url_dslevin2016zeb <- GEOquery::getGEOSuppFiles(geo_dslevin2016zeb, makeDirectory = FALSE, fetch_files = FALSE)
g_file_dslevin2016zeb <- paste0(data_folder, "dslevin2016zeb.txt.gz")
utils::download.file(url = as.character(g_url_dslevin2016zeb$url[2]), destfile = g_file_dslevin2016zeb)
X_dslevin2016zeb <- read.table(gzfile(g_file_dslevin2016zeb), h = T, sep = '\t', as.is = T, row.names = 1, comment.char = "")
# convert to tpm & ensembl_id
X_dslevin2016zeb <- X_dslevin2016zeb[rownames(X_dslevin2016zeb)%in%zeb_genes$ensembl_gene_id,]
X_dslevin2016zeb <- raw2tpm(rawcounts = X_dslevin2016zeb,
genelengths = zeb_genes$transcript_length[match(rownames(X_dslevin2016zeb),
zeb_genes$ensembl_gene_id)])
# pheno data
P_dslevin2016zeb <- Biobase::pData(GEOquery::getGEO(geo_dslevin2016zeb, getGPL = F)[[1]])
# filter relevant fields/samples
P_dslevin2016zeb <- P_dslevin2016zeb[, c("title", "geo_accession", "time (min after fertilization):ch1")]
colnames(P_dslevin2016zeb)[3] <- "time"
P_dslevin2016zeb$title <- as.character(P_dslevin2016zeb$title)
P_dslevin2016zeb <- P_dslevin2016zeb[P_dslevin2016zeb$title %in% colnames(X_dslevin2016zeb),]
X_dslevin2016zeb <- X_dslevin2016zeb[, P_dslevin2016zeb$title]
# formatting
P_dslevin2016zeb$title <- gsub('Metazome_ZF_timecourse_', '', P_dslevin2016zeb$title)
colnames(X_dslevin2016zeb) <- P_dslevin2016zeb$title
P_dslevin2016zeb$age <- as.numeric(P_dslevin2016zeb$time) / 60
dslevin2016zeb <- list(g = X_dslevin2016zeb, p = P_dslevin2016zeb)
save(dslevin2016zeb, file = paste0(data_folder, "dslevin2016zeb.RData"), compress = "xz")
# cleanup
file.remove(g_file_dslevin2016zeb)
rm(geo_dslevin2016zeb, g_url_dslevin2016zeb, g_file_dslevin2016zeb, X_dslevin2016zeb, P_dslevin2016zeb)
rm(zeb_genes, raw2tpm, fpkm2tpm)
dswhite2017$g <- limma::normalizeBetweenArrays(dswhite2017$g, method = "quantile")
dswhite2017$g <- log(dswhite2017$g + 1)
dslevin2016zeb$g <- limma::normalizeBetweenArrays(dslevin2016zeb$g, method = "quantile")
dslevin2016zeb$g <- log(dslevin2016zeb$g + 1)
dswhite2017$g[1:5, 1:5]
#> zmp_ph133_B zmp_ph133_D zmp_ph133_E zmp_ph133_F zmp_ph133_G
#> ENSDARG00000000001 2.192007 2.019082 1.929426 2.031762 1.9166338
#> ENSDARG00000000002 1.149510 1.188959 0.900076 1.185358 0.9783448
#> ENSDARG00000000018 2.456661 2.534134 2.224970 2.364784 2.5503750
#> ENSDARG00000000019 4.432509 4.529970 4.608232 4.533400 4.5923212
#> ENSDARG00000000068 4.406696 4.460862 4.267657 4.294028 4.1594844
head(dswhite2017$p, n = 5)
#> sample accession_number stage stageName sampleName age batch
#> 1 zmp_ph133_B ERS1079239 Zygote:1-cell 1-cell 1-cell-1 0 1
#> 2 zmp_ph133_D ERS1079240 Zygote:1-cell 1-cell 1-cell-2 0 2
#> 3 zmp_ph133_E ERS1079241 Zygote:1-cell 1-cell 1-cell-3 0 3
#> 4 zmp_ph133_F ERS1079243 Zygote:1-cell 1-cell 1-cell-4 0 4
#> 5 zmp_ph133_G ERS1079244 Zygote:1-cell 1-cell 1-cell-5 0 5
pca_dswhite2017 <- stats::prcomp(t(dswhite2017$g), rank = 25,
center = TRUE, scale = FALSE)
Notice how the sampling is sparser towards the end of the time series, with dynamics being “wider”. Fitting splines on the components here will lead to a poor fit of the earlier timepoints.
To bypass this issue, we can use ranks instead of the timepoints.
# using data.table's rank function to get the "dense" tie method
dswhite2017$p$rank <- data.table::frank(dswhite2017$p$age, ties.method = "dense")
These dynamics will be fitted much more cleanly. To predict the data in a uniform time scale, we can just pick values on the rank scale such that they translate to a uniform series on the age scale with a simple linear warp, as will be done below.
nc <- sum(summary(pca_dswhite2017)$importance[3,] < .99) + 1
nc
#> [1] 67
m_dswhite2017 <- ge_im(X = dswhite2017$g, p = dswhite2017$p, formula = "X ~ s(rank, bs = 'ds') + batch", nc = nc)
#> aCC aRE MSE aRMSE
#> 0.8593854 1.14166 0.03793716 0.1947746
We’ll be using a linear warp to get a uniform time series.
linwarp <- function(x, xyt, xc = 1, yc = 2){
# Computes a linear interpolation of input x to y value
# x = values of x to convert to y
# xyt = table with known sets of x/y
# xc, yc = column indices of x and y in xyt
if(min(x) < min(xyt[,xc]) | max(x) > max(xyt[,xc]))
stop("Some values of x are outside of the known x/y sets")
# set up y to x conversion table
xyt <- xyt[!duplicated.default(xyt[,xc]),]
xyt <- xyt[order(xyt[,xc]),]
xyt[,"dify"] <- c(0, diff(xyt[,yc]))
xyt[,"difx"] <- c(1, diff(xyt[,xc]))
xyt <- rbind(xyt[1,], xyt) # double 1st line for edge case
xout <- unlist(sapply(x, function(xi){
rsup <- which(xyt[-1,xc] >= xi)[1] + 1
xyt[rsup-1, yc] + (xi - xyt[rsup-1, xc])/xyt[rsup, "difx"] * xyt[rsup, "dify"]
}))
return(xout)
}
# setup newdat
n.inter <- 200
newdat <- data.frame(age = seq(min(dswhite2017$p[, "age"]), max(dswhite2017$p[, "age"]), l = n.inter),
batch = rep("1", n.inter)) # predict as batch 1
# apply linwarp
newdat$rank <- linwarp(newdat$age, xyt = dswhite2017$p[, c("age", "rank")], xc = 1, yc = 2)
head(newdat)
#> age batch rank
#> 1 0.0000000 1 1.000000
#> 2 0.6030151 1 1.804020
#> 3 1.2060302 1 2.304020
#> 4 1.8090452 1 2.706030
#> 5 2.4120603 1 3.216080
#> 6 3.0150754 1 4.011596
# predict
pred_m_dswhite2017 <- predict(m_dswhite2017, newdata = newdat)
pred_m_dswhite2017_comp <- predict(m_dswhite2017, newdata = newdat, as.c = TRUE)
We want a uniform series on the age scale, but have to input values
on the rank scale in the model which is why we use
linwarp(). To give a sense of what the function did, we can
plot the ranks against the age.
On the rank scale :
Back on the age scale :
# make a 'ref' object'
r_dswhite2017 <- make_ref(m_dswhite2017,
n.inter = 10,
cov.levels = list("batch"="1"),
t.unit = "h post-fertilization", # time unit
metadata = list("organism" = "D. rerio", # any metadata
"profiling" = "whole-organism, single-embryo",
"technology" = "RNAseq",
"note" = "rank-trick interpolation"))
# replace the default results with our rank prediction
r_dswhite2017$interpGE <- pred_m_dswhite2017
r_dswhite2017$time <- newdat$age
# stage samples
ae_dswhite2017 <- ae(dswhite2017$g, r_dswhite2017)
ae_dslevin2016zeb <- ae(dslevin2016zeb$g, r_dswhite2017)
We’ll build the same model (not the optimal model !) without considering uneven sampling, for comparison. This will also serve to show interpolation issues to look out for.
m_dswhite2017_norank <- ge_im(X = dswhite2017$g, p = dswhite2017$p, formula = "X ~ s(age, bs = 'ds') + batch", nc = nc)
#> aCC aRE MSE aRMSE
#> 0.8201534 0.6431004 2.668498 1.633554
pred_m_dswhite2017_norank <- predict(m_dswhite2017_norank, newdata = newdat)
pred_m_dswhite2017_comp_norank <- predict(m_dswhite2017_norank, newdata = newdat, as.c = TRUE)
Let’s plot the components on the rank and age scales, as before.
Back on the age scale :
We can already see that the model has trouble predicting the dynamics accurately. For example, we overfit in PC5 and flatten dynamics in PC6.
Consequences can be seen on the age estimates of external data (i.e. the validation data), but not necessarily on estimates of the reference data, as you’ll see when we stage the samples. This stresses the importance of validating references with independent data.
# make a 'ref' object'
r_dswhite2017_norank <- make_ref(m_dswhite2017_norank,
n.inter = 200,
cov.levels = list("batch"="1"),
t.unit = "h post-fertilization", # time unit
metadata = list("organism" = "D. rerio", # any metadata
"profiling" = "whole-organism, bulk",
"technology" = "RNAseq",
"note" = "NO rank-trick interpolation"))
ae_dswhite2017_norank <- ae(dswhite2017$g, r_dswhite2017_norank)
ae_dslevin2016zeb_norank <- ae(dslevin2016zeb$g, r_dswhite2017_norank)
The “gaps” or “steps” seen on the validation data’s estimates are due to interpolation bias : the overfitting and flattened dynamics mentioned above. Model errors create local “unlikely/unrealistic” gene expression zones in the interpolation, which will not correlate well with the samples of corresponding age. These zones will most often be in between original time points of the reference data, meaning the estimates of original reference samples are unaffected. However, validation data has clear ‘blank’ ranges of age estimates, around which are clustered the samples of corresponding age.
If the sub-optimal model we used had clear red flags on component plots, some forms interpolation bias can be much more subtle. In such cases, it can only be assessed through external data validation.